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ABSTRACT 

A subset of a real data set from a national survey on 
alcohol use and driving in Canada (original sample of 6,457) is used 
to illustrate an ad hoc analysis with missing data on multiple 
response variables. A complete-case analysis initiates this strategy, 
determining variables that may be deleted without losing effects of 
interest. By such a deletion, the number of complete observation 
vectors may very well increase. Also illustrated are two 
straightforward imputed means analyses. All illustrations are given 
in the context of predictive discriminant analysis. As discussed, the 
ad hoc strategy may be applicable to other multivariate contexts. 
Four tables illustrate steps in the analysis. (Contains 11 
references . ) (SLD) 
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Abstract 



A subset of a real data set is used to illustrate an ad hoc 
analysis with missing data on multiple response variables. This 
strategy is initialed with a complete-case analysis to determine some 
variable fs) that may be deleted with no loss in effects of interest. 
By such a deletion, the number of complete observation vectors may very 
well increase. Also illustrated are two straightforward imputed means 
analyses. All illustrations are given in the context of predictive 
discriminant analysis. 
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An Ad Hoc Analysis Strategy 
with Missing Data 

The purpose of this paper is to present an ad hoc analysis 
strategy with missing data for use in a predictive discriminant 
analysis context, and to illustrate the strategy using a subset of a 
reai Y 5a?a set Following the illust ration ^^"J**^ 
subjected to two imputed means analyses, again for illustrative 

purposes . 

Before presenting and il^^rating the proposed analysis strategy, 
the massing data problem is briefly -viewed consider a data matr 1X 
of N (total number g^^i^^ IfluVJ So.eti.es with 
response variables plus one gLuupxny . . n0 o S ible N-p response 
>-^i ^+-^ there are fewer than the total possioie in ^ ^ 

TarLbfeVefsurSr^his is an example of a J^^^^T^^ 
there is a row with more than, say, P/ 2 »i*"gg m& ^ e r same qoes for a 

( V-n in var?abLrwfth b mofe le th e an ^m^sing'measures 9 Even 

column (i.e., var i a ^) wirn there h ' e additional missing measures, 

subsequent to such d ^i°ns £^re m-y "working" data matrix (of N 
and the researcher needs to utilize tne y a ble-case 

rows) in some analysis. One possible analysis is a ^ f £ |J^^ i ^ sin g, 
analysis. With this analysis P ar ^ t a V c ™ie data matrix 

all available data; this involve, ; ua ^^^^^eS and is outside 

r t 0 he S d a omai C n 0 of m m S o;t lltlily available compuLr software (see Hand, 1981, 
pp. 191-193). 

T wo analysis strategies often considered by pract io 
with the final data set of interest are. th conduct the 

rnJrsi^'orchoiL^hl ^S^^^'^l- S 1Y ^ 
rowKayl* « N) rows--in 

strategy may be appropriate ^en the percent or ana i ys is 

values is "low." It is ^;/^ u ^ s ™ ly ^ e percent of missing values 

computer software (e.g., SAS and SPSS) . in P measures) — 

may be calculated for the total data matrix (N p 

for each group of units.] 

Data imputation can be fairly simple or f ai ^y c^Pl^- F °" r 
methods of imputation in a predictive discriminant analysis (PDA) 
context that have been studied are: 



or 



(1) 



(2) 




(3) 



Use the mean of all available scores in the respective 
groups or use the total-group means (see Hufnagel, x*ba, 
Jackson, 1968) ; 

sucstituje J"" th pooled covariance matrix; 

component analysis or x-ae " r.r-Lncioal 

Gilman, & Dunn, 1976) ; 

Use the expectation-maximization (^) algorithm, as 
described by Johnson and Wichern (1992, pp. 202 zun 
Twedt and Gill (1992); and 
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(4) Let the missing value of the ith predictor, X i , be 

determined as in (1) and similarly replace missing values 
for all other X's; then using X i as a criterion variable, 

conduct a multiple regression with the other X's as 
predictors and use X i as an imputed value; as described 
by Hufnagel (1988), extensive iterations may be 
incorporated . 

Over the past 25 years or so, a number of data simulation studies 
have been reported that compare methods of handling missing data in a 
^context.* Most of these studies focused on the two-group PDA 
situation Four two-group studies spanning the last four decades are 
now brieliy reviewed. Jackson (1968) compared the complete-case method 
IZ methods (1) and (4) mentioned above, and cone lf^ t ^ method ( 4 ) 
was best (in the sense of predictive accuracy), but not appreciably 
better than method (1) - it was not clear as to which mean was used 
Chan It IT (111 6) found that the complete-case method was "surprisingly 
good" (p 844) for p = 2 and p = 4, and "it is comforting to find that 
r method (1) with separate-group means] performs reasonably well' (p. 
84?) relative to methods (2) and (4). The conclusion Hufnagel (1988) 
drew were conditioned on predictor inter correlations, number (P^ £ 
nredictors and proportion of missing observations. It was conciuaea 
thlS "in the case of large correlations [the complete-case method and 
method (4)] can be recommended best" but [method (2)]" could be used 
if small proportions of missing" observations are given" (p. 74). 
Twedt and Gill (?992) concluded from" their "illations ^differences 
among methods (1) (which mean used is not clear) , (2), and (3) were 
slight, and that it is "better to replace missing data than to delete 
the observation vectors with missing data" (p. 1577). 



Group membership prediction rules typically used are normal-based 
rules The effect of (random or nonrandom) missing data on 
- Multivariate normality is a complicated issue that is not discussed 
Zll Tsee Murty^ Federer, 1991) . • In the ensuing discussion it is 
assumed ?ha? multivariate' normality is a condition that is reasonably 
met. 

^h S^oposed ad hoc analysis, no data imputation is involved. 
This strateqy may be described in the context of predictive 
discrimSanfanalysis (PDA) as follows-. Let there be N* (< N) units 
for which there are complete p-dimensional observation vectors. With 
an N*-by-(p+l) data matrix, a complete-case PDA is conducted. From 
Sis analysis it may be reasonable to conclude that one of the p 
predic?or Y va?iables may be deleted with little loss or in fact a gain, 
in predictive accuracy! Moreover, there may be unlt ; H wl ^ h ^ S ?^ g data 
on tne deleted predictor, but with complete data on the other p-1 
predictors. One can then return to the original dat a matrix and 
determine a new data matrix of N** rows and (P+D.-l columns ' Victors 

ER | C . ^yLd Aaa!n?T"eak" Variable may be deleted; and so on. It ,s 
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recoanized that multiple decisions would need to be made on different 
H»?» 9 Sa??? C es Needless to say, judgment and reasonableness would need 
to L exerci^ d. tsom'imSs iS'«ly S judged reasonable to delete more 
tnan one predictor at a time. Suppose one considers deleting two 

predictors. Then what might be done is to conduct [ 2 j analyses to 
determine which pair of predictors could be advisably de ^ed^ 
Continuing to look for the next two predictors to drop, one would 

analyses; and so on.] 



conduct 



p-2 

2 



*" ^Sat^ A data set was obtained via a national telephone survey in 
Canada^The A survey dealt with alcohol consumption and automobile 
drivina Two types of drivers were determined: Group 1 con sistea or 
Sose who did drive after drinking; and Group 2 consisted of those who 
did not drivi after drinking. The purpose of one study ^hat utilized 
SKhata set (DeJoy, Hubeity, & Shewokis, 1993) was to develop a rule 

^^^^f^^^^ SSSSoT^SSS Sere 
consIder4d Par ^ne U iniSal data matrix had 6816 rows and 14 columns; one 
column was simply a group membership indlc ^°r. Four 13x13 correia 

rnftsr s forr a cnirou^ 

to IV so high as^o include there was extensive variable redundancy, 
all 13 predictor- were retained for study. A total of 1705 drivers 
comprised Group 1 whereas 5111 drivers were in Group 2. 

There were 359 drivers (62 in Group 1 and 297 in Group 2) who had 
mis-ind score values on at least one of the 13 predictors. For the 359 
Srivers\ ^scores (9,9%) were missing in Group 1 and 56 so orj.^ 

^eUer/noTissfng^aS ^^aSTw'i ^1^?$^ * 
milling values had 34 missing values in Group 1 and 172 in Group 2. 

Prior to selecting the additional drivers from each original 

group, potential outliers ^ 4 ^« l -^S^l«i5S.tion rule 
DISCRIMINANT program (Release 4.0 . A quadratic c assir 

was used with V^^^*^ f g ^ u p "Sn^^lSely to have 
^^^ir^^t^^^Sx^ ^ a typicality probability of 
resrSaf.OS, thentLt Liver was -side -da potential outlier. 

Tma^efgrourafd -e^coun retain the approximate ^r^^ 

the 497 (13) Group 2 values are "losing; aero *J ™ ™£ e g ^iginal sample 
O 659(131 values, about 7.6% are missing. Counts tor ^ne oriy v 
ERIC and for the subset selected for. this study are^ summarized in Table 1. 
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Table 1 
Group Counts 



Original Units 

Complete vectors 



Group 1 



Group 2 



(Outliers 
Incomplete vectors 



Selected Units 

Complete vectors 

Incomplete vectors 



1643 


A O 1 A 

4814 


O *i *J 1 


65 


6) 




62 


297 


359 


1705 


5111 


6816 


100 


200 


300 


. 62 


297 


359 


162 


497 


659 



Ana lysis . It was necessary to make some decisions about the 
speci fics of the PDA techniques to be used. For the group sizes 
Svoived and thS two group covariance matrices (with outliers deleted 

Rov test v ielded P = .0000), it was decided that a quadratic 
Sasstf icalU ru^was to be used (see Huberty i n press Section . 
IV-31 On the basis of familiar previous research data, prior 
probibiuSies of .30 for Group 1 and .70 for Group 2 were .judged to be 

?| S ra?esT; Zi^^T?-^^ £ 
S r ^ ~^fri-or proba^r/of^O w£"utili«d; 
a driver had io "yield" a posterior probability of group ° f 
at least .60 to be assigned to one group or the other. Finally, an 

rat^ of interest for the current study is that for Group ^ 
ShfSUS rrocedur e e1n°Shr S i^ pacKage^elease 6.07) was used. 

reS pect«eir ^i^cKiaa^to t °o 1 s2s 1 ssisr^. ,, s 

mav vary across tne 1j anaiyseb. j 11 Q * . . T r«;+-vi ^ r^iptprf 
; y i , hl hit ra te) is associated with a aeie-cea 

real^or,^^ 

IX^T^^^^ hi? rates 

a2socta?ed with two or more deleted variables are "close." 

8 
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Results . A summary of all steps in the analysis process is given 
in Table 2. The 13 12-variable analyses indicated that deleting V10 
would actually increase the Group 1 L-O-G hit rate _ (from .460 to .512). 
As is obvious, by deleting the least important variable, V10, the 
number of rows in the complete-case matrix increases from 300 to 420. 
It turns out that by deleting three of the 13 predictors (one at a 
time) , the number of drivers for whom complete score vectors were 
available increased from 300 to 461 while the Group 1 hit rate 
"stabilized" at about .53. By deleting a fourth variable, no 
appreciable increase in the number of complete vectors resulted without 
a drop in the Group 1 hit rate. So, with this data set, one could 
?ea5onably utilize a complete-case data set having 461 rows out of a 
ttllTof 659 rows with the ad hoc analysis strategy. In utilizing this 
data set there would be a 28% increase in the number of Group 1 
drivers, a 66.5% increase in the number of Group 2 drivers, and a 53. 7« 
increase in the total number of drivers. 

One might ask: "Why not use an analysis that utilizes all of the 
drivers on whom you have partial or complete observation vectors?" To 
do To one coSld use some method of data imputation. We discuss such 
an approach next. 

Imputed Means Analyses 

As mentioned earlier, two types of means may be used for 
imputation purposes: total-group means and separate-group means We 
iSdge the latter to better approximate the "real" observations (that 
are missing) and, therefore, focus on them. As indicated in the brief 
review in the introduction of this paper, the simple imputation method 
of replacing missing observations with means fares- pretty well when 
compared with more complicated imputation methods (at least for the 
two-group context) . 

Imouted means may be utilized in a predictive discriminant 
analysts in two ways: (i) Impute all missing observations using means 
To compete the dara mairlx, determine a rule using this completed data 
matrix and assess the rule using a L-0-0 analysis; and (n) Determine 
a claSsification rule using only the complete observation vectors, and 
?h2n apply rhe rule to the datamatr ix with all missing observations 
imputed with means so as to arrive at hit rate estimates. 

Both imputed means analyses were conducted on the data set 
described above, using the SAS package (Release to 6.07). 

IfflB ut ed Means Analysis fi) . Separate-group jeans were calculate 
using available data for eacn response variable f or wh ^£ N ^e"ere 
cnmp missina scores. This was accomplished using the MEANS procedure 
frol ?a! in 3£ original N* (p+1) data matrix, each missing data point 
for a predictor was supplanted by the corresponding group mean of the 
D?edictor. The resulting augmented matrix data matrix (N-659) was used 
?or d Se°following analysis, *?irst ; 13 -predict or PDA was inducted (N 
= 659) using a quadratic L-0-0 analysis. Second, 13 12 predictor 
analyses were conducted to determine if, by deleting a predictor, 
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Table 2 
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Predictors 


Predictor 
Deleted 


Group 1 L-0-0 
Hit Rate 


Complete Case 
G l G 2 


Nos . 
Total 


13 


(none) 


.460 


100 


200 


300 


12 


V10 


.512 


125 


295 


420 


11 


V4 


. 52b 


127 


327 


454 


10 


V7 


. 531 


128 


333 


461 



an increase in the Group 1 hit rate (relative to the 13 -predictor hit 
rate? would result. It turned out that by deleting V7 the Group 1 
L-0-0 hit rate increased from .506 to .518. Third, 12 11-predictor 
analyses were conducted. It turned out that by deleting V2 or V or 
V9 the Group 1 L-0-0 hit rate remained at .518. Thus, three sets of 
ll'lO-predictor analyses were conducted, one set with V7 and V2 
deleted, one with V7 and V4 deleted, and one with V7 and V9 deleted 
It turned out that by deleting that last pair along with VI, the Group 
1 L-0-0 hit rate increased to .525 - this was a greater increase than 
those resulting from the sets of analyses with the other two pairs 
deleted. A summary of the analyses is presented in Table 3. 

Tmpnted Mean* Analysis fii). With this method, a classification 
rule is built using the 3 00 complete observation vectors. Then this 
rule is applied to^he data matrix with 659 rows wherein separate-group 
means replace the respective missing variable values. This type of 
analysis may be carried out using the SPSS DISCRIMINANT 



Table 3 

Results of Imputed Means Analysis (i) 



No. Predictors 
13 
12 
11 
10 



Predictor Group 1 L-0-0 
Deleted Hit Rate — 

(none) - 506 
V7 -518 
V9 -518 
VI .525 
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oro-ram (with the keyword MEANSUBSTITUTION) , except that total-group 
insteaS of separate-group means would be used as imputed values. 

compete observations and subsequently apply therule £ data ^ 

Est s\ 5. °} ^^B^^Jr^^^^^ 

c om pl?ea n w?tn e Sose V obta 0 ined via uhe two previous analysis 
approaches. 

Results of this imputed means analysis of th set are 

given in Table. 4 " is not too surprising that the internal^ ^ 

corresPOndfnr-S-0 Mt'ates'ulfn^analysis (i, and using the ad hoc 



analysis. 
Discussion 



'The intent of presenting the three analysis strategies was not to 
compare* i, -n/ empirical sense ^--^X any 

tney°may S'.EuS'S a situation involving three or more 
groups in a similar manner. 

The ad hoc strategy in general may be applicable in °^} e r 
multivariate contexts ; for example ^^^S^^i^Sl^lSS^ 
(„ mult pie ^^^I^^SSg were used in a 
SSSovJ context two questions would need to be addressed: (1) What 

consonant with the selected interpretation, is to be usea. 



Table 4 

Results of Impute Means Analysis fii), 



no. Predictors 
13 
12 
11 



Predictor Group 1 

Deleted Hjt Rate 

(none) .531 



V4 



543 



V9 -549 



11 
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to these questions would presumably determine the dimensions of th* 
data sets to be used in the analysis strategy. With the data set used 
earlier in this paper, for example, it would have to be described if 
?he initial analyses should be based on a data matrix with 300 rows or 
with varying number of rcws if multiple analyses are carried out in the 
processT The above two questions would also have to be considered in 
other analysis context* - see Huberty (1989) for a discussion on 
variable ordering. . 

The general philosophy behind the proposed ad hoc analysis 
strategy is: Do the best with what you have. Some might argue that 
the available data may be utilized in estimating missing scores which 
would result in a more acceptable analysis Perhaps. Some data 
imputation methods assume randomly missing data. In a given study, now 
"good" imputed data are is virtually unknown. Another type of question 
till may be asked in a context like that used herein is : What is to 
sav that a variable would be assessed as an unimportant predictor if 
more measure! on the variable were available? . Of course an a ^ 
analvsis strateqv such as that proposed here is expected to work with 
varying proflcSncy across different data sets. It may very well be an 
analysis strategy of choice for some real data sets. 
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